Part 1
Part 2
Part 3
R has several core data structures:
Vectors form the basis of R data structures.
Two main types- and , but I will treat lists separately.
Here is an R vector. The elements of the vector are numeric values.
x = c(1, 3, 2, 5, 4)
x[1] 1 3 2 5 4
All elements of an atomic vector are the same type. Examples include:
A important type of vector is a factor.
Factors are used to represent categorical data structures.
x = factor(1:3, labels=c('q', 'V', 'what the heck?'))
x[1] q V what the heck?
Levels: q V what the heck?
The underlying representation is numeric.
But, factors are categorical.
They can’t be used as numbers would be.
as.numeric(x)[1] 1 2 3
sum(x)Error in Summary.factor(structure(1:3, .Label = c("q", "V", "what the heck?": 'sum' not meaningful for factors
With multiple dimensions, we are dealing with arrays.
Matrices are 2-d arrays, and extremely commonly used.
The vectors making up a matrix must all be of the same type.
Creating a matrix can be done in a variety of ways.
# create vectors
x = 1:4
y = 5:8
z = 9:12
rbind(x, y, z) # row bind [,1] [,2] [,3] [,4]
x 1 2 3 4
y 5 6 7 8
z 9 10 11 12
cbind(x, y, z) # column bind x y z
[1,] 1 5 9
[2,] 2 6 10
[3,] 3 7 11
[4,] 4 8 12
matrix(c(x, y, z), nrow=3, ncol=4, byrow=TRUE) [,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 5 6 7 8
[3,] 9 10 11 12
Lists in R are highly flexible objects.
They can contain anything as their elements, even other lists.
Here is a list. We use the list function to create one.
x = list(1, "apple", list(3, "cat"))
x[[1]]
[1] 1
[[2]]
[1] "apple"
[[3]]
[[3]][[1]]
[1] 3
[[3]][[2]]
[1] "cat"
We often want to loop some function over a list.
for(elem in x) class(elem)Lists can, and often do, have named elements.
x = list("a" = 25, "b" = -1, "c" = 0)
x["b"]$b
[1] -1
data.frames are a very commonly used data structure.
They do not have to have the same type of element.
This is because the data.frame class is actually just a list.
As such, everything about lists applies to data.frames.
But they can also be indexed by row or column as well
mydf = data.frame(a = c(1,5,2),
b = c(3,8,1))We can add row names also.
rownames(mydf) = paste0('row', 1:3)
mydf a b
row1 1 3
row2 5 8
row3 2 1
Standard methods of reading in data
Using the foreign package:
Note: the foreign package is no longer useful for Stata files.
haven: Package to read in foreign statistical files
readxl: for excel files
readr: Faster versions of base R functions
These make assumptions after an initial scan of the data.
If you don’t have ‘big’ data, this won’t help much.
However, they actually can be used as a diagnostic.
data.table: faster read.table
Typically faster than readr approaches.
Note that R can handle many types of data.
Some examples:
And many, many others.
feather: designed to make reading and writing data frames efficient
Works in both Python and R.
Still in early stages of development.
Slicing vectors
letters[4:6][1] "d" "e" "f"
letters[c(13,10,3)][1] "m" "j" "c"
Slicing matrices/data.frames
myMatrix[1, 2:3]Label-based indexing:
mydf['row1', 'b']Position-based indexing:
mydf[1, 2]Mixed indexing:
mydf['row1', 2]If the row/column value is empty, all rows/columns are retained.
mydf['row1',]
mydf[,'b']Non-contiguous:
mydf[c(1,3),]Boolean:
mydf[mydf$a >=2,]List/Data.frame extraction
[ : grab a slice of elements/columns
[[ : grab specific elements/columns
$ : grab specific elements/columns
my_list_or_df[2:4]my_list_or_df[['name']]my_list_or_df$nameLogicals are objects with values of TRUE or FALSE.
Assume x is a vector of numbers.
idx = x > 2
idx
x[idx]We don’t have to create a Boolean object before using it.
R indexing is ridiculously flexible.
x[x > 2]
x[x != 3]
x[ifelse(x > 2, T, F)]
x[{y = idx; y}]Consider the following loop:
for (i in 1:nrow(mydf)) {
check = mydf$x[i] > 2
if (check==TRUE){
mydf$y[i] = 'Yes'
} else {
mydf$y[i] = 'No'
}
}Compare:
mydf$y = 'No'
mydf$y[mydf$x > 2] = 'Yes'This gets us the same thing, and would be much faster.
Boolean indexing is an example of a vectorized operation.
The whole vector is considered.
This is always faster.
Log all values in a matrix.
mymatrix_log = log(mymatrix)Way faster than looping over elements, rows or columns.
Many vectorized functions already exist in R.
They are often written in C, Fortran etc., and so even faster.
A family of functions allows for a succinct way of looping.
Common ones include:
Standardizing variables.
for (i in 1:ncol(mydf)){
x = mydf[,i]
for (j in 1:length(x)){
x[j] = (x[j] - mean(x))/sd(x)
}
}The above would be a really bad way to use R.
stdize <- function(x) {
(x-mean(x))/sd(x)
}
apply(mydf, 2, stdize)The previous demonstrates how to use apply.
However, there is a scale function in base R.
Unit: milliseconds
expr min lq mean median uq max neval
doubleloop 3022.709630 3034.807864 3096.19295 3065.54156 3089.56331 3397.40877 25
singleloop 30.183746 31.827583 34.86938 32.96380 33.91120 70.31997 25
plyr 124.389190 127.980656 133.05413 129.47050 133.70827 168.24146 25
apply 33.302114 34.240506 35.96508 34.99698 36.29230 46.64223 25
parApply 18.038382 19.397640 22.72359 21.36238 22.41544 63.38740 25
vectorized 8.080799 9.737545 13.68819 10.17552 11.80675 54.61436 25
Benefits
NOT faster than explicit loops.
ALWAYS can potentially be faster than loops.
I use R every day, and rarely use explicit loops.
I never use a double loop.
Apply functions should be a part of your regular R experience.
Other versions we’ll talk about have been optimized.
However, you need to know the basics in order to use those.
Any you still may need parallel versions.
Note:
More detail on much of this part is given in another workshop.
Operators that send what comes before to what comes after.
There are many different pipes.
There are many packages that use their own.
However, the vast majority of packages use the same pipe:
Here, we’ll focus on their use with the dplyr package.
Later, we’ll use it for visualizations.
Example.
mydf %>%
select(var1, var2) %>%
filter(var1 == 'Yes') %>%
summaryStart with a data.frame %>%
    select columns from it %>%
    filter/subset it %>%
    get a summary
We can use variables as soon as they are created.
mydf %>%
mutate(newvar1 = var1 + var2,
newvar2 = newvar1/var3) %>%
summarise(newvar2avg = mean(newvar2))Generic example.
basegraph %>%
points %>%
lines %>%
layoutMost functions are not ‘pipe-aware’ by default.
Example: pipe to a modeling function.
mydf %>%
lm(y ~ x) # errorOther pipes can handle this.
But generally, one can use a dot.
mydf %>%
lm(y ~ x, data=.)Piping is not just for data.frames.
c('Ceci', "n'est", 'pas', 'une', 'pipe!') %>%
{
.. <- . %>%
if (length(.) == 1) .
else paste(.[1], '%>%', ..(.[-1]))
..(.)
} [1] "Ceci %>% n'est %>% pas %>% une %>% pipe!"
Pipes are best used interactively.
Extremely useful for data exploration.
Common in many visualization packages.
See the magrittr package for more pipes.
Original data management package of the three.
More general than dplyr.
Not as useful for most common operations, but contains:
adply, dlply etc.
library(plyr)
x = list(var1=1:5, var2=2:6)
ldply(x) .id V1 V2 V3 V4 V5
1 var1 1 2 3 4 5
2 var2 2 3 4 5 6
ldply(x, sum) .id V1
1 var1 15
2 var2 20
Option to parallelize.
*ply: apply style functions, with parallel capability
join_all: Recursively join a list of data frames
rbind.fill: row bind data.frames, filling in missing columns.
mapvalues/revalue: replace values
round_any: Round to multiple of any number.
Grammar of data manipulation.
Next iteration of plyr.
Focused on tools for working with data frames.
It has three main goals:
Make the most important data manipulation tasks easier.
Do them faster.
Use the same interface to work with data frames, a data tables or database.
Some key operations:
select: grab columns
filter/slice: grab rows
group_by: grouped operations
mutate/transmute: create new variables
summarize: summarize/aggregate
do: arbitrary operations
Various join/merge functions.
Little things like:
No need to quote variable names.
Let’s say we want to select from our data the following variables:
How might we go about this?
Tedious, or typically two steps just to get the columns you want.
# numeric indexes; not conducive to readibility or reproducibility
newData = oldData[,c(1,2,3,4, etc.)]
# explicitly by name; fine if only a handful; not pretty
newData = oldData[,c('ID','X1', 'X2', etc.)]
# two step with grep; regex difficult to read/understand
cols = c('ID', paste0('X', 1:10), 'var1', 'var2', grep(colnames(oldData), '^XYZ', value=T))
newData = oldData[,cols]
# or via subset
newData = subset(oldData, select = cols)What if you also want observations where Z is Yes, Q is No, and only the observations with the top 50 values of var2, ordered by var1 (descending)?
# three operations and overwriting or creating new objects if we want clarity
newData = newData[oldData$Z == 'Yes' & oldData$Q == 'No',]
newData = newData[order(newData$var2, decreasing=T)[1:50],]
newData = newData[order(newData$var1, decreasing=T),]And this is for fairly straightforward operations.
newData = oldData %>%
filter(Z == 'Yes', Q == 'No') %>%
select(num_range('X', 1:10), contains('var'), starts_with('XYZ')) %>%
top_n(var2, n=50) %>%
arrange(desc(var1))dplyr and piping is an alternative
Even though the initial base R approach depicted is fairly concise, it still can potentially be:
Two primary functions for manipulating data
Other useful functions include:
library(tidyr)
stocks <- data.frame( time = as.Date('2009-01-01') + 0:9,
X = rnorm(10, 0, 1),
Y = rnorm(10, 0, 2),
Z = rnorm(10, 0, 4) )
stocks %>% head time X Y Z
1 2009-01-01 0.6370511 -0.01400084 -1.6791027
2 2009-01-02 -1.2042548 -0.18505288 0.4843967
3 2009-01-03 0.8386316 0.56394319 4.4498663
4 2009-01-04 0.4729477 -2.83158426 -1.5725511
5 2009-01-05 0.2300034 1.19861116 2.6563916
6 2009-01-06 -0.5510324 -0.71985395 -4.3385482
stocks %>% gather(stock, price, -time) %>% head time stock price
1 2009-01-01 X 0.6370511
2 2009-01-02 X -1.2042548
3 2009-01-03 X 0.8386316
4 2009-01-04 X 0.4729477
5 2009-01-05 X 0.2300034
6 2009-01-06 X -0.5510324
The dplyr grammar is clear for a lot of standard data processing tasks, and some not so common.
Extremely useful for data exploration and visualization.
Drawbacks:
multidplyr
Partitions the data across a cluster.
Faster than data.table (after partitioning)
data.table works in a notably different way than dplyr.
However, you’d use it for the same reasons.
Like dplyr, the data objects are both data.frames and a package specific class.
Faster subset, grouping, update, ordered joins and list columns
In general, data.table works with brackets as in base R.
However, the brackets work like a function call!
x[i, j, by, keyby, with = TRUE, ...]Importantly:
you can’t use the brackets as you would with data.frames.
library(data.table)
df = data.table(x=sample(1:10, 6), g=1:3, y=runif(6))
df[,4][1] 4
x[i, j, by, keyby, with = TRUE, ...]What i and j can be are fairly complex.
In general, you use i for filtering by rows.
df[2]
df[2,] x g y
1: 3 2 0.6004823
x g y
1: 3 2 0.6004823
x[i, j, by, keyby, with = TRUE, ...]In general, you use j to select (by name!) or create new columns.
df[,x]
df[,z:=x+y] # df now has a new column[1] 1 3 10 6 4 8
x g y z
1: 1 1 0.7955191 1.795519
2: 3 2 0.6004823 3.600482
3: 10 3 0.1054822 10.105482
4: 6 1 0.9908635 6.990863
5: 4 2 0.5685444 4.568544
6: 8 3 0.9724441 8.972444
Dropping columns is awkward.
df[,-y] # creates negative values of y
df[,-'y', with=F] # drops y, but now needs quotes
df[,y:=NULL] # drops y, but this is just a base R approach
df$y = NULL[1] -0.7955191 -0.6004823 -0.1054822 -0.9908635 -0.5685444 -0.9724441
x g z
1: 1 1 1.795519
2: 3 2 3.600482
3: 10 3 10.105482
4: 6 1 6.990863
5: 4 2 4.568544
6: 8 3 8.972444
x g z
1: 1 1 1.795519
2: 3 2 3.600482
3: 10 3 10.105482
4: 6 1 6.990863
5: 4 2 4.568544
6: 8 3 8.972444
group-by, with creation of a new variable.
Note that these actually modify df in place.
df1 = df2 = df
df[,sum(x,y), by=g] # sum of all x and y values g V1
1: 1 33
2: 2 33
3: 3 44
df1[,newvar := sum(x), by=g] # add new variable to the original data x g z newvar
1: 1 1 1.795519 7
2: 3 2 3.600482 7
3: 10 3 10.105482 18
4: 6 1 6.990863 7
5: 4 2 4.568544 7
6: 8 3 8.972444 18
df1 x g z newvar
1: 1 1 1.795519 7
2: 3 2 3.600482 7
3: 10 3 10.105482 18
4: 6 1 6.990863 7
5: 4 2 4.568544 7
6: 8 3 8.972444 18
We can also create groupings on the fly.
For a new summary data set, we’ll take the following approach.
df2[, list(meanx = mean(x), sumx = sum(x)), by=g==1] g meanx sumx
1: TRUE 3.50 7
2: FALSE 6.25 25
df1[df2]The following demonstrates some timings from here
By the way, never, ever use aggregate. For anything.
fun elapsed
1: aggregate 114.35
2: by 24.51
3: sapply 11.62
4: tapply 11.33
5: dplyr 10.97
6: lapply 10.65
7: data.table 2.71
Ever.
Really.
Can be done but awkward at best.
mydf[,newvar:=mean(x),][,newvar2:=sum(newvar), by=group][,-'y', with=FALSE]
mydf[,newvar:=mean(x),
][,newvar2:=sum(newvar), by=group
][,-'y', with=FALSE
]Probably better to just use a pipe and dot approach
mydf[,newvar:=mean(x),] %>%
.[,newvar2:=sum(newvar), by=group] %>%
.[,-'y', with=FALSE]Faster methods are great to have.
Drawbacks:
If speed and/or memory is (potentially) a concern, data.table
For interactive exploration, dplyr
Piping allows one to use both, so no need to choose.
ggplot2 is an extremely popular package for visualization in R.
It entails a grammar of graphics.
Key ideas:
Strengths:
Aesthetics allow one to map data to aesthetic aspects of the plot.
The function used in ggplot to do this is aes
aes(x=myvar, y=myvar2, color=myvar3, group=g)In general, we start with a base layer and add to it.
In most cases you’ll start as follows.
ggplot(aes(x=myvar, y=myvar2), data=mydata)This would just produce a plot background.
Layers are added via piping.
The first layers added are typically geoms:
ggplot2 was using pipes before it was cool, and so it has a different pipe.
Otherwise, the concept is the same as before.
ggplot(aes(x=myvar, y=myvar2), data=mydata) +
geom_point()And now we would have a scatterplot.
library(ggplot2)
data("diamonds"); data('economics')
ggplot(aes(x=carat, y=price), data=diamonds) +
geom_point()ggplot(aes(x=date, y=unemploy), data=economics) +
geom_line()In the following, one setting is not mapped to the data.
ggplot(aes(x=carat, y=price), data=diamonds) +
geom_point(aes(size=carat, color=clarity), alpha=.25) There are many statistical functions built in.
Key strength: you don’t have to do much preprocessing.
Quantile regression lines:
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_quantile()Loess (or additive model) smooth:
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth()Bootstrapped confidence intervals:
ggplot(mtcars, aes(cyl, mpg)) +
geom_point() +
stat_summary(fun.data = "mean_cl_boot", colour = "orange", alpha=.75, size = 1)Facets allow for paneled display, a very common operation.
In general, we often want comparison plots.
facet_grid will produce a grid.
facet_wrap is more flexible.
Both use a formula approach to specify the grouping.
ggplot(mtcars, aes(wt, mpg)) +
geom_point() +
facet_grid(vs ~ cyl, labeller = label_both)ggplot(mtcars, aes(wt, mpg)) +
geom_point() +
facet_wrap(vs ~ cyl, labeller = label_both, ncol=2)ggplot2 makes it easy to get good looking graphs quickly.
However the amount of fine control is extensive.
ggplot(aes(x=carat, y=price), data=diamonds) +
geom_point(aes(color=clarity), alpha=.5) +
scale_y_log10(breaks=c(1000,5000,10000)) +
xlim(0, 10) +
scale_color_brewer(type='div') +
facet_wrap(~cut, ncol=3) +
theme_minimal() +
theme(axis.ticks.x=element_line(color='darkred'),
axis.text.x=element_text(angle=-45),
axis.text.y=element_text(size=20),
strip.text=element_text(color='forestgreen'),
strip.background=element_blank(),
panel.grid.minor=element_line(color='lightblue'),
legend.key=element_rect(linetype=4),
legend.position='bottom')
In the last example you saw two uses of a theme.
Each argument takes on a specific value or an element function:
The base theme is not too good.
You will almost invariably need to tweak it.
ggplot2 now has its own extension system.
There is even a website to track the extensions.
Examples include:
ggplot2 is an easy to use, but powerful visualization tool.
Allows one to think in many dimensions for any graph:
2d graphs are only useful for conveying the simplest of ideas.
Use ggplot2 to easily create more interesting visualizations.
ggplot2 is the most widely used package for visualization in R.
However, it is not interactive by default.
Many packages use htmlwidgets, d3 (JavaScript library) etc. to provide interactive graphics.
General:
Specific functionality:
One of the advantages to piping is that it’s not limited to dplyr style data management functions.
Any R function can be potentially piped to.
This facilitates data exploration, especially visually.
Many newer visualization packages take advantage of piping.
htmlwidgets is a package that makes it easy to create javascript visualizations.
The packages using it typically are pipe-oriented and produce interactive plots.
A couple demonstrations with plotly.
Note the layering as with ggplot2.
Piping used before plotting.
library(plotly)
midwest %>%
filter(inmetro==T) %>%
plot_ly(x=percollege, y=percbelowpoverty, mode='markers') plotly has modes, which allow for points, lines, text and combinations.
Traces work similar to geoms.
library(mgcv)
mtcars %>%
mutate(amFactor = factor(am, labels=c('auto', 'manual')),
hovertext = paste(wt, mpg, amFactor),
prediction = predict(gam(mpg~s(wt), data=mtcars))) %>%
arrange(wt) %>%
plot_ly(x=wt, y=mpg, color=amFactor, width=800, height=500, mode='markers') %>%
add_trace(x=wt, y=prediction, alpha=.5, hover=hovertext, name='gam prediction')The nice thing about plotly is that we can feed a ggplot to it.
It would have been easier to use geom_smooth, so let’s do so.
gp = mtcars %>%
mutate(amFactor = factor(am, labels=c('auto', 'manual')),
hovertext = paste(wt, mpg, amFactor),
prediction = predict(gam(mpg~s(wt), data=mtcars))) %>%
arrange(wt) %>%
ggplot(aes(x=wt, y=mpg)) +
geom_smooth() +
geom_point(aes(color=amFactor))
ggplotly(width='auto')dygraphs is useful for time-series.
library(dygraphs)
data(UKLungDeaths)
cbind(ldeaths, mdeaths, fdeaths) %>%
dygraph(width=800) %>%
dyOptions(stackedGraph = TRUE, colors=RColorBrewer::brewer.pal(3, name='Dark2')) %>%
dyRangeSelector(height = 20)visNetwork allows for network visualizations
library(visNetwork)
visNetwork(nodes, edges, height=600, width=800) %>%
visNodes(shape='circle',
font=list(),
scaling=list(min=10, max=50, label=list(enable=T))) %>%
visLegend()Use the DT package for interactive dataframes.
library(DT)
movies %>%
select(1:6) %>%
filter(rating>9) %>%
slice(sample(1:nrow(.), 50)) %>%
datatable(rownames=F)Shiny is a framework that can essentially allow you to build an interactive website.
Most of the more recently developed visualization packages will work specifically within the shiny and rmarkdown settings.
Interactivity allows for even more dimensions to be brought to a graphic.
Interactive graphics are more fun too!
Just a couple visualization packages can go a very long way.
With the right tools, data exploration can be:
Use them to wring your data dry of what it has to offer.